In today’s data-driven world, job roles related to data are not only proliferating but also evolving at a rapid pace. From data scientists and data engineers to analysts and architects, the spectrum of data-related professions continues to expand. As these roles become more nuanced and specialized, determining a competitive salary becomes a challenge for both employers and potential employees.
Enter “Glassdoor’s Data Jobs- Salary Predictor” - our state-of-the-art predictive model designed to address this very challenge. Leveraging vast amounts of salary data available on Glassdoor, one of the world’s largest job and recruiting platforms, our model hopes to accurate salary predictions for various data-centric job roles.
In the following sections, we will delve deeper into the mechanics of our model, the data that powers it, and the actionable insights it can provide. Let’s embark on this journey to demystify the world of data job salaries on Glassdoor.
Objective 1: Build a model with the main goal to identify key relationships between the average salary and the job descriptions, such as the required skills, job titles, job locations, etc. and more importantly, is highly interpretable.
Objective 2: Build multiple models with the goal of developing a model that can predict the best and do well on future data.
Our data set was from Kaggle. It included 742 job postings in 2021, which scraped from Glassdoor using a Selenium-based scraper. After scrapping, the raw data set was split into 42 columns, including the original job posts.
#Libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(corrplot)
## corrplot 0.92 loaded
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(naniar)
setwd("~/Desktop/MSDS/Term2_Fall2023/Stat2/Projects/Project1/R")
data<-read.csv("data_cleaned_2021.csv")
#Checking data types
str(data)
## 'data.frame': 742 obs. of 42 variables:
## $ index : int 0 1 2 3 4 5 6 7 8 9 ...
## $ Job.Title : chr "Data Scientist" "Healthcare Data Scientist" "Data Scientist" "Data Scientist" ...
## $ Salary.Estimate : chr "$53K-$91K (Glassdoor est.)" "$63K-$112K (Glassdoor est.)" "$80K-$90K (Glassdoor est.)" "$56K-$97K (Glassdoor est.)" ...
## $ Job.Description : chr "Data Scientist\nLocation: Albuquerque, NM\nEducation Required: Bachelor’s degree required, preferably in math, "| __truncated__ "What You Will Do:\n\nI. General Summary\n\nThe Healthcare Data Scientist position will join our Advanced Analyt"| __truncated__ "KnowBe4, Inc. is a high growth information security company. We are the world's largest provider of new-school "| __truncated__ "*Organization and Job ID**\nJob ID: 310709\n\nDirectorate: Earth & Biological Sciences\n\nDivision: Biological "| __truncated__ ...
## $ Rating : num 3.8 3.4 4.8 3.8 2.9 3.4 4.1 3.8 3.3 4.6 ...
## $ Company.Name : chr "Tecolote Research\n3.8" "University of Maryland Medical System\n3.4" "KnowBe4\n4.8" "PNNL\n3.8" ...
## $ Location : chr "Albuquerque, NM" "Linthicum, MD" "Clearwater, FL" "Richland, WA" ...
## $ Headquarters : chr "Goleta, CA" "Baltimore, MD" "Clearwater, FL" "Richland, WA" ...
## $ Size : chr "501 - 1000 " "10000+ " "501 - 1000 " "1001 - 5000 " ...
## $ Founded : int 1973 1984 2010 1965 1998 2000 2008 2005 2014 2009 ...
## $ Type.of.ownership : chr "Company - Private" "Other Organization" "Company - Private" "Government" ...
## $ Industry : chr "Aerospace & Defense" "Health Care Services & Hospitals" "Security Services" "Energy" ...
## $ Sector : chr "Aerospace & Defense" "Health Care" "Business Services" "Oil, Gas, Energy & Utilities" ...
## $ Revenue : chr "$50 to $100 million (USD)" "$2 to $5 billion (USD)" "$100 to $500 million (USD)" "$500 million to $1 billion (USD)" ...
## $ Competitors : chr "-1" "-1" "-1" "Oak Ridge National Laboratory, National Renewable Energy Lab, Los Alamos National Laboratory" ...
## $ Hourly : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Employer.provided : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Lower.Salary : int 53 63 80 56 86 71 54 86 38 120 ...
## $ Upper.Salary : int 91 112 90 97 143 119 93 142 84 160 ...
## $ Avg.Salary.K. : num 72 87.5 85 76.5 114.5 ...
## $ company_txt : chr "Tecolote Research" "University of Maryland Medical System" "KnowBe4" "PNNL" ...
## $ Job.Location : chr "NM" "MD" "FL" "WA" ...
## $ Age : int 48 37 11 56 23 21 13 16 7 12 ...
## $ Python : int 1 1 1 1 1 1 0 1 0 1 ...
## $ spark : int 0 0 1 0 0 0 0 1 0 1 ...
## $ aws : int 0 0 0 0 0 1 0 1 0 0 ...
## $ excel : int 1 0 1 0 1 1 1 1 0 0 ...
## $ sql : int 0 0 1 0 1 1 0 1 0 0 ...
## $ sas : int 1 0 1 0 1 0 0 0 0 0 ...
## $ keras : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pytorch : int 0 0 0 0 0 0 0 1 0 0 ...
## $ scikit : int 0 0 0 0 0 0 0 0 0 0 ...
## $ tensor : int 0 0 0 0 0 0 0 1 0 0 ...
## $ hadoop : int 0 0 0 0 0 0 0 0 0 0 ...
## $ tableau : int 1 0 0 0 0 0 0 0 0 0 ...
## $ bi : int 1 0 0 0 0 1 0 0 0 0 ...
## $ flink : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mongo : int 0 0 0 0 0 1 0 0 0 0 ...
## $ google_an : int 0 0 0 0 0 0 0 0 0 0 ...
## $ job_title_sim : chr "data scientist" "data scientist" "data scientist" "data scientist" ...
## $ seniority_by_title: chr "na" "na" "na" "na" ...
## $ Degree : chr "M" "M" "M" "na" ...
vis_miss(data)
dim(data)
## [1] 742 42
summary(data)
## index Job.Title Salary.Estimate Job.Description
## Min. : 0.0 Length:742 Length:742 Length:742
## 1st Qu.:221.5 Class :character Class :character Class :character
## Median :472.5 Mode :character Mode :character Mode :character
## Mean :469.1
## 3rd Qu.:707.8
## Max. :955.0
## Rating Company.Name Location Headquarters
## Min. :-1.000 Length:742 Length:742 Length:742
## 1st Qu.: 3.300 Class :character Class :character Class :character
## Median : 3.700 Mode :character Mode :character Mode :character
## Mean : 3.619
## 3rd Qu.: 4.000
## Max. : 5.000
## Size Founded Type.of.ownership Industry
## Length:742 Min. : -1 Length:742 Length:742
## Class :character 1st Qu.:1939 Class :character Class :character
## Mode :character Median :1988 Mode :character Mode :character
## Mean :1837
## 3rd Qu.:2007
## Max. :2019
## Sector Revenue Competitors Hourly
## Length:742 Length:742 Length:742 Min. :0.00000
## Class :character Class :character Class :character 1st Qu.:0.00000
## Mode :character Mode :character Mode :character Median :0.00000
## Mean :0.03234
## 3rd Qu.:0.00000
## Max. :1.00000
## Employer.provided Lower.Salary Upper.Salary Avg.Salary.K.
## Min. :0.00000 Min. : 15.00 Min. : 16.0 Min. : 15.5
## 1st Qu.:0.00000 1st Qu.: 52.00 1st Qu.: 96.0 1st Qu.: 73.5
## Median :0.00000 Median : 69.50 Median :124.0 Median : 97.5
## Mean :0.02291 Mean : 74.75 Mean :128.2 Mean :101.5
## 3rd Qu.:0.00000 3rd Qu.: 91.00 3rd Qu.:155.0 3rd Qu.:122.5
## Max. :1.00000 Max. :202.00 Max. :306.0 Max. :254.0
## company_txt Job.Location Age Python
## Length:742 Length:742 Min. : -1.00 Min. :0.0000
## Class :character Class :character 1st Qu.: 12.00 1st Qu.:0.0000
## Mode :character Mode :character Median : 25.00 Median :1.0000
## Mean : 47.52 Mean :0.5283
## 3rd Qu.: 60.00 3rd Qu.:1.0000
## Max. :277.00 Max. :1.0000
## spark aws excel sql
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :1.0000 Median :1.0000
## Mean :0.2251 Mean :0.2372 Mean :0.5229 Mean :0.5121
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## sas keras pytorch scikit
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.08895 Mean :0.03908 Mean :0.05256 Mean :0.07278
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000
## tensor hadoop tableau bi
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.00000 Median :0.0000 Median :0.0000 Median :0.00000
## Mean :0.09703 Mean :0.1671 Mean :0.1995 Mean :0.07547
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## flink mongo google_an job_title_sim
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Length:742
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 Class :character
## Median :0.00000 Median :0.00000 Median :0.00000 Mode :character
## Mean :0.01348 Mean :0.04987 Mean :0.01887
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000
## seniority_by_title Degree
## Length:742 Length:742
## Class :character Class :character
## Mode :character Mode :character
##
##
##
After evaluating all of the variables,we decided to drop some of them because of their existence lead to the redundancy of other variables. For example, the job title was a detailed job title from the job postings. Most of them are unique and would not provide a meaningful prediction.
The Salary Estimates is the salary ranged provided in the job listings, based on the Salary estimates, we got the upper and the slower salary and thus, find the average salary by dividing the sum of the upper and the slower salary by 2. That is why we removed all of the variables relating to the salary except the average salary. We also removed the other variables on the list for the same reason.
We moved forward with 28 variables. Following are the variable that were dropped from all data set. 1 - index 2 - Job.Title (the same as job_title_sim) 3 - Salary Estimate 4 - Job Description (Glassdor job posts) 6 - Company Name 7 - Location 8 - Headquarter 10 - Founded (Year that company was founded) 12 - Industry 15 - Competitors 17 - Employer Provided 18 - Lower Salary 19 - Upper Salary 21 - Company text
data <- data[ , -c(1,2,3,4,6,7,8,10,12,15,17,18,19,21)]
#Find the na values
columns_with_na <- names(data)[colSums(is.na(data)) > 0]
# Display the column names with NA values
print(columns_with_na)
## character(0)
There is no null values based on the initial checking point.
##Factorize the categorized variables
data <- data %>% mutate_if(is.character, factor)
# Replace "na" levels with "not specified" using fct_recode()
data$job_title_sim <- fct_recode(data$job_title_sim, "other data related jobs" = "na")
There are 10 variables that the job_stitle_sim are “na”. Upon checking the original data sets, these job postings does relate to data analysis even though their job titles do not contain “data”. We decided to keep the data points but create a new level for the job title, called “other data related jobs.
# Replace "-1" levels with "not specified" using fct_recode()
data$Sector <- fct_recode(data$Sector, "not specified" = "-1")
There are 10 variables that the sector are “na”. We decided to keep the data points but create a new level for missing sector, called “not specified”.
# Assuming 'data' is your data frame
# Use colnames() to change the column name
colnames(data)[colnames(data) == "Type.of.ownership"] <- "Ownership_type"
colnames(data)[colnames(data) == "Age"] <- "Comp_history"
colnames(data)[colnames(data) == "Avg.Salary.K."] <- "AvgSalary"
colnames(data)[colnames(data) == "job_title_sim"] <- "Job_title"
colnames(data)[colnames(data) == "seniority_by_title"] <- "Experience"
# Replace "-1" levels with "not specified" using fct_recode()
data$Experience <- fct_recode(data$Experience, "no experience required" = "na")
There are 519 job postings do not mention experiences. We assume that job does not limit applicants from a specific level of experience. So we simply replace the “na” value with “no experience required”.
#Corrplot
# Select only numeric columns from the data frame
numeric_data <- select_if(data, is.numeric)
# Compute the correlation matrix for numeric variables
cor_matrix <- cor(numeric_data)
# Customize corrplot (e.g., change color, method, etc.)
# For more customization options, check the documentation: ?corrplot
corrplot(cor_matrix, method = "color", tl.cex = 0.7)
Notably, keras, pytorch, scikit and tensor has high multi-collinearity. This is expected as keras and pytorch are the features built on top of tensor flow. Therefore, if a job require one of these skills, there are high chance that the employee will also need to be familiar with the other skills to do their jobs efficiently. To reduce the dependency between these variables, we combine it as one variable called “ML_skills”. Similarly, we combined Hadoop and spark as “data_framework”; tableau and bi were combined as data_visualization.
# Create a new column by summing the four columns
data$ML_skills <- data$keras + data$pytorch + data$scikit+ data$tensor
data$vizualization <- ifelse(data$bi == 1 & data$tableau == 1, "both",
ifelse(data$bi == 1 & data$tableau == 0, "bi",
ifelse(data$bi == 0 & data$tableau == 1, "tableau", "none")))
data$data_framework <- ifelse(data$spark == 1 & data$hadoop == 1, "both",
ifelse(data$spark == 1 & data$hadoop == 0, "spark",
ifelse(data$spark == 0 & data$hadoop == 1, "hadoop", "none")))
11 - spark
16 - pytorch 17 - keras 18 - scikit 19 - tensorflow 20 - tableau 21 - bi
22 - hadoop
We removed the original variables of the skills mentioned above.
# data <- data[ , -c(25,30,31,32,33,34,35,36)]
data <- data[ , -c(11,16,17,18,19,20,21,22)]
#Checking multicollinearity
#Corrplot
# Select only numeric columns from the data frame
numeric_data <- select_if(data, is.numeric)
# Compute the correlation matrix for numeric variables
cor_matrix <- cor(numeric_data)
# Customize corrplot (e.g., change color, method, etc.)
# For more customization options, check the documentation: ?corrplot
corrplot(cor_matrix, method = "color", tl.cex = 0.7)
Notice now the multicollinearity is improved. We noticed that among the numerical variables, the highest correlation is between the average salary and Python (0.324)
#Display the counts of each level of the categorial varibles
for (col in names(data)) {
if (is.factor(data[[col]])) {
cat("Variable:", col, "\n")
cat(table(data[[col]]), "\n\n")
}
}
## Variable: Size
## 31 130 150 117 76 134 94 10
##
## Variable: Ownership_type
## 13 410 193 15 15 55 5 2 34
##
## Variable: Sector
## 10 1 25 1 4 112 97 3 4 23 42 11 49 180 69 34 6 3 9 14 8 15 6 8 8
##
## Variable: Revenue
## 60 8 32 124 91 39 40 19 18 46 57 4 204
##
## Variable: Job.Location
## 8 9 152 11 5 11 6 16 6 5 2 40 10 3 6 4 103 35 6 2 9 21 4 17 3 72 14 4 33 1 1 13 28 10 41 21 10
##
## Variable: Job_title
## 101 8 119 5 313 16 5 22 10 143
##
## Variable: Experience
## 3 519 220
##
## Variable: Degree
## 252 383 107
# Remove rows where Rating is -1 using logical indexing
data <- data[data$AvgSalary != 15.5, ]
#Replace the not Specific Company Age as Median
# Calculate the median of comp_age excluding -1 values
median_age <- median(data$Comp_history[data$Comp_history != -1])
# Replace -1 with the calculated median
data$Comp_history[data$Comp_history == -1] <- median_age
median_age
## [1] 29
Upon checking the outliers, we removed some of the outliers and missing variables. We removed the job positing from NYPD which is a data scientist full time job that only require excel skills and the average salary is 15.5K annually. Compared to the cost of living in New York an the job is full time, we believe there must be some error in this job postings. We don’t think it is possible for a job like this exist with this average salary.
We also removed the job that has rating is -1 and company size is unknown, because they don’t provide any information except the salary.
##Factorize the categorized variables
data <- data %>% mutate_if(is.character, factor)
Right-skewed as expected for salary.
hist(data$AvgSalary, breaks = 20, col = "lightblue", xlab = "Average Salary", main = "Histogram of AvgSalary")
#Rating vs Average Salary by the Job titles
# Scatter Plot- Original Data
ggplot(data, aes(x=Rating, y = AvgSalary, color= Job_title)) +
geom_point() +
labs(title = "Scatter Plot of Rating vs. Average Salary")
# Scatter Plot Linear -Log
ggplot(data, aes(x=Rating, y = log(AvgSalary), color= Job_title)) +
geom_point() +
labs(title = "Scatter Plot of Rating vs. Average Salary")
# Company Age vs. Average Salary by the Job titles
# Scatter Plot - Original Data
ggplot(data, aes(x=Comp_history, y = AvgSalary, color=Ownership_type)) +
geom_point() +
labs(title = "Scatter Plot of Comp_History vs. Average Salary")
# Scatter Plot - Log- Linear
ggplot(data, aes(x=log(Comp_history), y = AvgSalary, color=Ownership_type)) +
geom_point() +
labs(title = "Scatter Plot of Comp_History vs. Average Salary")
# Scatter Plot Linear-Log
ggplot(data, aes(x=Comp_history, y = log(AvgSalary), color=Ownership_type)) +
geom_point() +
labs(title = "Scatter Plot of Comp_History vs. Average Salary")
# Scatter Plot Log- log
ggplot(data, aes(x=log(Comp_history), y = log(AvgSalary), color=Ownership_type)) +
geom_point() +
labs(title = "Scatter Plot of Comp_History vs. Average Salary")
## Ml_skills vs. Average Salary by job titles
ggplot(data, aes(x=ML_skills, y =AvgSalary, color= Job_title)) + geom_point()
We noticed that the log transformation improved the scatter plot. We moved forward with the log transformation with our initial model
ggpairs(data[,c(1,2,3,5,6,7)], mapping=ggplot2::aes())
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggpairs(data[,c(9,10,11,12,13,7)], mapping=ggplot2::aes())
ggpairs(data[,c(14,15,16,17,18,19,7)], mapping=ggplot2::aes())
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggpairs(data[,c(20,21,22,23,7)], mapping=ggplot2::aes())
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Fit the model
fit = lm(AvgSalary~., data = data)
summary(fit)
##
## Call:
## lm(formula = AvgSalary ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -70.571 -12.905 -1.037 10.602 108.462
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 44.73837 32.11109 1.393
## Rating 1.73824 1.87625 0.926
## Size10000+ -0.53898 8.60715 -0.063
## Size1001 - 5000 4.05616 7.24471 0.560
## Size201 - 500 0.25156 6.92637 0.036
## Size5001 - 10000 -5.36205 8.51146 -0.630
## Size501 - 1000 -1.74591 6.97613 -0.250
## Size51 - 200 5.63115 6.66380 0.845
## Sizeunknown 30.11390 16.15176 1.864
## Ownership_typeCompany - Private -15.11921 14.26265 -1.060
## Ownership_typeCompany - Public -12.50770 14.71616 -0.850
## Ownership_typeGovernment -18.67891 21.07024 -0.887
## Ownership_typeHospital -41.76636 17.51145 -2.385
## Ownership_typeNonprofit Organization -28.95782 14.64491 -1.977
## Ownership_typeOther Organization -18.32663 18.06747 -1.014
## Ownership_typeSchool / School District -24.69313 22.53815 -1.096
## Ownership_typeSubsidiary or Business Segment -10.58444 14.93809 -0.709
## SectorAccounting & Legal -12.61080 31.33039 -0.403
## SectorAerospace & Defense -5.84538 18.81672 -0.311
## SectorAgriculture & Forestry -14.89962 33.02559 -0.451
## SectorArts, Entertainment & Recreation -36.69274 24.43375 -1.502
## SectorBiotech & Pharmaceuticals -5.49146 17.39556 -0.316
## SectorBusiness Services -11.34504 18.01210 -0.630
## SectorConstruction, Repair & Maintenance -15.38139 24.23122 -0.635
## SectorConsumer Services -15.35573 22.38271 -0.686
## SectorEducation -24.82890 19.69335 -1.261
## SectorFinance -8.92907 18.12152 -0.493
## SectorGovernment -16.77857 19.69248 -0.852
## SectorHealth Care -0.53742 17.95268 -0.030
## SectorInformation Technology -1.38739 17.54772 -0.079
## SectorInsurance -3.90119 17.45519 -0.223
## SectorManufacturing -20.11386 18.36815 -1.095
## SectorMedia -11.30796 21.02874 -0.538
## SectorMining & Metals -32.28058 24.01207 -1.344
## SectorNon-Profit -9.81162 21.55776 -0.455
## SectorOil, Gas, Energy & Utilities -29.88576 23.92769 -1.249
## SectorReal Estate -19.70168 20.72451 -0.951
## SectorRetail -6.01066 18.57948 -0.324
## SectorTelecommunications -16.04517 21.25466 -0.755
## SectorTransportation & Logistics -17.16275 20.25440 -0.847
## SectorTravel & Tourism -18.67992 21.10833 -0.885
## Revenue$1 to $5 million (USD) -20.26773 11.92480 -1.700
## Revenue$10 to $25 million (USD) -11.37047 7.35221 -1.547
## Revenue$10+ billion (USD) 10.20060 6.73155 1.515
## Revenue$100 to $500 million (USD) 0.03939 5.69592 0.007
## Revenue$2 to $5 billion (USD) -4.33756 6.53917 -0.663
## Revenue$25 to $50 million (USD) -2.03401 7.60006 -0.268
## Revenue$5 to $10 billion (USD) -3.84303 8.41960 -0.456
## Revenue$5 to $10 million (USD) 28.16622 10.12453 2.782
## Revenue$50 to $100 million (USD) 6.18520 6.68587 0.925
## Revenue$500 million to $1 billion (USD) -3.24564 6.62245 -0.490
## RevenueLess than $1 million (USD) -4.19414 17.35186 -0.242
## RevenueUnknown / Non-Applicable 0.63990 5.29845 0.121
## Hourly -13.29510 9.32546 -1.426
## Job.LocationAZ 2.86966 13.82039 0.208
## Job.LocationCA 47.15945 10.82261 4.357
## Job.LocationCO 31.78393 13.83040 2.298
## Job.LocationCT 0.61671 17.09251 0.036
## Job.LocationDC 37.01444 13.57808 2.726
## Job.LocationDE 16.79117 18.19177 0.923
## Job.LocationFL 9.98537 12.29349 0.812
## Job.LocationGA -3.39986 14.83153 -0.229
## Job.LocationIA 20.95790 16.11890 1.300
## Job.LocationID 17.24120 21.09292 0.817
## Job.LocationIL 37.28471 11.75181 3.173
## Job.LocationIN 6.68039 13.87555 0.481
## Job.LocationKS -7.44824 21.26598 -0.350
## Job.LocationKY 50.35207 16.33637 3.082
## Job.LocationLA 9.08656 17.28016 0.526
## Job.LocationMA 22.13743 11.29726 1.960
## Job.LocationMD 22.34487 11.03884 2.024
## Job.LocationMI 37.64724 15.75463 2.390
## Job.LocationMN 39.84411 21.05357 1.893
## Job.LocationMO 34.72739 14.33060 2.423
## Job.LocationNC 21.80533 12.71251 1.715
## Job.LocationNE 10.35477 17.75121 0.583
## Job.LocationNJ 40.66820 12.64037 3.217
## Job.LocationNM -16.19410 18.35588 -0.882
## Job.LocationNY 28.90479 11.20980 2.579
## Job.LocationOH 25.25122 12.97189 1.947
## Job.LocationOR 18.65677 17.06833 1.093
## Job.LocationPA 20.02570 12.35532 1.621
## Job.LocationRI 41.11211 28.75606 1.430
## Job.LocationSC 8.63612 27.28213 0.317
## Job.LocationTN 3.62162 13.25275 0.273
## Job.LocationTX 22.28932 12.00974 1.856
## Job.LocationUT 44.91421 16.37831 2.742
## Job.LocationVA 19.77194 11.12446 1.777
## Job.LocationWA 24.18280 13.40694 1.804
## Job.LocationWI 15.02319 14.17840 1.060
## Comp_history 0.05272 0.03040 1.734
## Python 9.25073 2.66096 3.476
## aws 1.25197 2.73394 0.458
## excel 0.29790 2.21488 0.134
## sql -7.45072 2.80046 -2.661
## sas 13.14897 4.14333 3.174
## flink -2.92467 9.30753 -0.314
## mongo 16.14220 5.28079 3.057
## google_an -6.30308 9.75108 -0.646
## Job_titledata analitics 0.96320 10.99728 0.088
## Job_titledata engineer 29.02483 4.56200 6.362
## Job_titledata modeler 19.83431 13.06796 1.518
## Job_titledata scientist 35.82931 3.99772 8.962
## Job_titleData scientist project manager 3.77889 7.80475 0.484
## Job_titledirector 55.81974 12.23907 4.561
## Job_titlemachine learning engineer 51.51131 7.87601 6.540
## Job_titleother data related jobs 5.61757 9.75807 0.576
## Job_titleother scientist 16.46659 5.24085 3.142
## Experienceno experience required -10.52320 15.78448 -0.667
## Experiencesr 14.00099 15.90342 0.880
## Degreena 3.24468 2.54486 1.275
## DegreeP 7.87797 3.88378 2.028
## ML_skills 0.41671 1.51747 0.275
## vizualizationboth 15.44948 11.05894 1.397
## vizualizationnone 10.82997 10.53173 1.028
## vizualizationtableau 4.19043 10.90814 0.384
## data_frameworkhadoop 10.01528 6.00461 1.668
## data_frameworknone 2.85529 3.88756 0.734
## data_frameworkspark 3.86361 4.53445 0.852
## Pr(>|t|)
## (Intercept) 0.164046
## Rating 0.354574
## Size10000+ 0.950090
## Size1001 - 5000 0.575764
## Size201 - 500 0.971039
## Size5001 - 10000 0.528939
## Size501 - 1000 0.802462
## Size51 - 200 0.398416
## Sizeunknown 0.062731 .
## Ownership_typeCompany - Private 0.289531
## Ownership_typeCompany - Public 0.395691
## Ownership_typeGovernment 0.375687
## Ownership_typeHospital 0.017374 *
## Ownership_typeNonprofit Organization 0.048445 *
## Ownership_typeOther Organization 0.310813
## Ownership_typeSchool / School District 0.273671
## Ownership_typeSubsidiary or Business Segment 0.478867
## SectorAccounting & Legal 0.687447
## SectorAerospace & Defense 0.756172
## SectorAgriculture & Forestry 0.652036
## SectorArts, Entertainment & Recreation 0.133676
## SectorBiotech & Pharmaceuticals 0.752350
## SectorBusiness Services 0.529020
## SectorConstruction, Repair & Maintenance 0.525808
## SectorConsumer Services 0.492935
## SectorEducation 0.207862
## SectorFinance 0.622375
## SectorGovernment 0.394526
## SectorHealth Care 0.976128
## SectorInformation Technology 0.937007
## SectorInsurance 0.823222
## SectorManufacturing 0.273923
## SectorMedia 0.590950
## SectorMining & Metals 0.179326
## SectorNon-Profit 0.649173
## SectorOil, Gas, Energy & Utilities 0.212134
## SectorReal Estate 0.342153
## SectorRetail 0.746417
## SectorTelecommunications 0.450594
## SectorTransportation & Logistics 0.397121
## SectorTravel & Tourism 0.376523
## Revenue$1 to $5 million (USD) 0.089701 .
## Revenue$10 to $25 million (USD) 0.122483
## Revenue$10+ billion (USD) 0.130194
## Revenue$100 to $500 million (USD) 0.994485
## Revenue$2 to $5 billion (USD) 0.507371
## Revenue$25 to $50 million (USD) 0.789072
## Revenue$5 to $10 billion (USD) 0.648234
## Revenue$5 to $10 million (USD) 0.005567 **
## Revenue$50 to $100 million (USD) 0.355265
## Revenue$500 million to $1 billion (USD) 0.624239
## RevenueLess than $1 million (USD) 0.809083
## RevenueUnknown / Non-Applicable 0.903911
## Hourly 0.154463
## Job.LocationAZ 0.835578
## Job.LocationCA 1.54e-05 ***
## Job.LocationCO 0.021886 *
## Job.LocationCT 0.971229
## Job.LocationDC 0.006590 **
## Job.LocationDE 0.356361
## Job.LocationFL 0.416960
## Job.LocationGA 0.818764
## Job.LocationIA 0.194012
## Job.LocationID 0.414017
## Job.LocationIL 0.001585 **
## Job.LocationIN 0.630366
## Job.LocationKS 0.726276
## Job.LocationKY 0.002146 **
## Job.LocationLA 0.599188
## Job.LocationMA 0.050496 .
## Job.LocationMD 0.043376 *
## Job.LocationMI 0.017164 *
## Job.LocationMN 0.058887 .
## Job.LocationMO 0.015664 *
## Job.LocationNC 0.086794 .
## Job.LocationNE 0.559884
## Job.LocationNJ 0.001361 **
## Job.LocationNM 0.377994
## Job.LocationNY 0.010151 *
## Job.LocationOH 0.052031 .
## Job.LocationOR 0.274789
## Job.LocationPA 0.105564
## Job.LocationRI 0.153309
## Job.LocationSC 0.751692
## Job.LocationTN 0.784734
## Job.LocationTX 0.063935 .
## Job.LocationUT 0.006277 **
## Job.LocationVA 0.076001 .
## Job.LocationWA 0.071754 .
## Job.LocationWI 0.289746
## Comp_history 0.083402 .
## Python 0.000543 ***
## aws 0.647159
## excel 0.893052
## sql 0.008003 **
## sas 0.001580 **
## flink 0.753454
## mongo 0.002333 **
## google_an 0.518260
## Job_titledata analitics 0.930234
## Job_titledata engineer 3.86e-10 ***
## Job_titledata modeler 0.129578
## Job_titledata scientist < 2e-16 ***
## Job_titleData scientist project manager 0.628430
## Job_titledirector 6.14e-06 ***
## Job_titlemachine learning engineer 1.28e-10 ***
## Job_titleother data related jobs 0.565037
## Job_titleother scientist 0.001758 **
## Experienceno experience required 0.505224
## Experiencesr 0.378996
## Degreena 0.202787
## DegreeP 0.042942 *
## ML_skills 0.783710
## vizualizationboth 0.162908
## vizualizationnone 0.304200
## vizualizationtableau 0.700994
## data_frameworkhadoop 0.095833 .
## data_frameworknone 0.462940
## data_frameworkspark 0.394511
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.57 on 622 degrees of freedom
## Multiple R-squared: 0.6366, Adjusted R-squared: 0.5677
## F-statistic: 9.236 on 118 and 622 DF, p-value: < 2.2e-16
# QQ plot of the residuals
qqnorm(residuals(fit))
qqline(residuals(fit))
# Diagnostic plots
par(mfrow = c(2, 3))
# Residuals vs Fitted Values
plot(fit, which = 1)
# Scale-Location (also called Spread-Location)
plot(fit, which = 3)
## Warning: not plotting observations with leverage one:
## 290, 449
# Cook's distance plot
plot(fit, which = 4)
# Residuals vs Leverage
plot(fit, which = 5)
## Warning: not plotting observations with leverage one:
## 290, 449
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
# Histogram of residuals
hist(resid(fit))
Independence: We assume that each job posting on Glassdor is
independent.
Linearity: Even though the linearity between the salary an the predictors individual not very strong (0.324 as the highest correlation), we assume that there is a linearity between the average salary with all of the predictors remained in the model.
Normality: is a little right skewed but not a big concern here.
Homoscedasticity: We can see that the residuals do not form a random clouds in the residual vs. fitted plot. There are evidence of the non constant variances. Therefore, we will continue with the log transformation of the average salary.
#Fit the model
data$log_Comp_history <-log(data$Comp_history)
logfit = lm(log(AvgSalary)~. -Comp_history, data = data)
summary(logfit)
##
## Call:
## lm(formula = log(AvgSalary) ~ . - Comp_history, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.70870 -0.12663 -0.00022 0.13511 0.68143
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 3.983480 0.303570 13.122
## Rating 0.031741 0.017153 1.850
## Size10000+ -0.070397 0.078673 -0.895
## Size1001 - 5000 -0.005350 0.066295 -0.081
## Size201 - 500 -0.018578 0.063368 -0.293
## Size5001 - 10000 -0.080856 0.077686 -1.041
## Size501 - 1000 -0.067763 0.063739 -1.063
## Size51 - 200 0.022459 0.061006 0.368
## Sizeunknown 0.328132 0.148097 2.216
## Ownership_typeCompany - Private -0.252642 0.131799 -1.917
## Ownership_typeCompany - Public -0.195259 0.135518 -1.441
## Ownership_typeGovernment -0.217954 0.193397 -1.127
## Ownership_typeHospital -0.439855 0.160922 -2.733
## Ownership_typeNonprofit Organization -0.386823 0.134372 -2.879
## Ownership_typeOther Organization -0.307498 0.166020 -1.852
## Ownership_typeSchool / School District -0.292904 0.206609 -1.418
## Ownership_typeSubsidiary or Business Segment -0.233690 0.137505 -1.700
## SectorAccounting & Legal -0.124767 0.286391 -0.436
## SectorAerospace & Defense -0.089084 0.172109 -0.518
## SectorAgriculture & Forestry -0.101400 0.302285 -0.335
## SectorArts, Entertainment & Recreation -0.488445 0.223348 -2.187
## SectorBiotech & Pharmaceuticals -0.073336 0.159074 -0.461
## SectorBusiness Services -0.158007 0.164640 -0.960
## SectorConstruction, Repair & Maintenance -0.103899 0.221951 -0.468
## SectorConsumer Services -0.152161 0.204661 -0.743
## SectorEducation -0.296657 0.180198 -1.646
## SectorFinance -0.156146 0.165662 -0.943
## SectorGovernment -0.184666 0.180021 -1.026
## SectorHealth Care -0.029150 0.164200 -0.178
## SectorInformation Technology -0.084921 0.160509 -0.529
## SectorInsurance -0.090207 0.159545 -0.565
## SectorManufacturing -0.208316 0.167914 -1.241
## SectorMedia -0.144420 0.192204 -0.751
## SectorMining & Metals -0.282324 0.218630 -1.291
## SectorNon-Profit -0.132115 0.197121 -0.670
## SectorOil, Gas, Energy & Utilities -0.335708 0.218696 -1.535
## SectorReal Estate -0.206678 0.189530 -1.090
## SectorRetail -0.055485 0.169846 -0.327
## SectorTelecommunications -0.265296 0.193986 -1.368
## SectorTransportation & Logistics -0.166545 0.185119 -0.900
## SectorTravel & Tourism -0.202046 0.192967 -1.047
## Revenue$1 to $5 million (USD) -0.204127 0.109236 -1.869
## Revenue$10 to $25 million (USD) -0.139671 0.067201 -2.078
## Revenue$10+ billion (USD) 0.113521 0.060257 1.884
## Revenue$100 to $500 million (USD) -0.011986 0.052135 -0.230
## Revenue$2 to $5 billion (USD) -0.072780 0.060307 -1.207
## Revenue$25 to $50 million (USD) -0.060709 0.069473 -0.874
## Revenue$5 to $10 billion (USD) -0.052888 0.076841 -0.688
## Revenue$5 to $10 million (USD) 0.225070 0.092541 2.432
## Revenue$50 to $100 million (USD) 0.024643 0.061147 0.403
## Revenue$500 million to $1 billion (USD) -0.052393 0.060660 -0.864
## RevenueLess than $1 million (USD) -0.060789 0.158634 -0.383
## RevenueUnknown / Non-Applicable -0.015636 0.048875 -0.320
## Hourly -0.352789 0.085077 -4.147
## Job.LocationAZ 0.088787 0.126397 0.702
## Job.LocationCA 0.491711 0.098913 4.971
## Job.LocationCO 0.393735 0.126573 3.111
## Job.LocationCT 0.085571 0.156168 0.548
## Job.LocationDC 0.364760 0.124101 2.939
## Job.LocationDE 0.382696 0.166284 2.301
## Job.LocationFL 0.127096 0.112536 1.129
## Job.LocationGA 0.049514 0.135688 0.365
## Job.LocationIA 0.187296 0.147358 1.271
## Job.LocationID 0.206046 0.192813 1.069
## Job.LocationIL 0.371819 0.107428 3.461
## Job.LocationIN 0.131656 0.127039 1.036
## Job.LocationKS 0.068822 0.194333 0.354
## Job.LocationKY 0.587230 0.149323 3.933
## Job.LocationLA 0.102116 0.158319 0.645
## Job.LocationMA 0.291315 0.103225 2.822
## Job.LocationMD 0.336930 0.101019 3.335
## Job.LocationMI 0.469252 0.144102 3.256
## Job.LocationMN 0.466280 0.192446 2.423
## Job.LocationMO 0.356299 0.131078 2.718
## Job.LocationNC 0.297738 0.116273 2.561
## Job.LocationNE 0.160604 0.162177 0.990
## Job.LocationNJ 0.482334 0.115494 4.176
## Job.LocationNM -0.091422 0.167682 -0.545
## Job.LocationNY 0.345844 0.102488 3.374
## Job.LocationOH 0.330631 0.118684 2.786
## Job.LocationOR 0.277846 0.156253 1.778
## Job.LocationPA 0.265526 0.113046 2.349
## Job.LocationRI 0.679800 0.263015 2.585
## Job.LocationSC 0.129786 0.249392 0.520
## Job.LocationTN 0.072963 0.121036 0.603
## Job.LocationTX 0.273732 0.109791 2.493
## Job.LocationUT 0.460476 0.149605 3.078
## Job.LocationVA 0.278813 0.101778 2.739
## Job.LocationWA 0.280368 0.122625 2.286
## Job.LocationWI 0.196728 0.129778 1.516
## Python 0.079923 0.024322 3.286
## aws 0.004677 0.025019 0.187
## excel 0.010542 0.020224 0.521
## sql -0.043798 0.025591 -1.711
## sas 0.102320 0.037617 2.720
## flink 0.011613 0.085075 0.136
## mongo 0.170830 0.048296 3.537
## google_an -0.071536 0.089185 -0.802
## Job_titledata analitics 0.045318 0.100507 0.451
## Job_titledata engineer 0.384302 0.041712 9.213
## Job_titledata modeler 0.258836 0.119365 2.168
## Job_titledata scientist 0.419277 0.036592 11.458
## Job_titleData scientist project manager 0.065028 0.071412 0.911
## Job_titledirector 0.611658 0.111858 5.468
## Job_titlemachine learning engineer 0.554015 0.072025 7.692
## Job_titleother data related jobs 0.140856 0.089151 1.580
## Job_titleother scientist 0.234983 0.047982 4.897
## Experienceno experience required -0.084432 0.144116 -0.586
## Experiencesr 0.141686 0.145236 0.976
## Degreena 0.010001 0.023121 0.433
## DegreeP 0.040409 0.035182 1.149
## ML_skills 0.014921 0.013871 1.076
## vizualizationboth 0.245430 0.101218 2.425
## vizualizationnone 0.170059 0.096223 1.767
## vizualizationtableau 0.124925 0.099689 1.253
## data_frameworkhadoop 0.085295 0.054868 1.555
## data_frameworknone 0.025254 0.035538 0.711
## data_frameworkspark 0.028888 0.041589 0.695
## log_Comp_history -0.003586 0.014911 -0.240
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## Rating 0.064719 .
## Size10000+ 0.371236
## Size1001 - 5000 0.935708
## Size201 - 500 0.769480
## Size5001 - 10000 0.298370
## Size501 - 1000 0.288136
## Size51 - 200 0.712892
## Sizeunknown 0.027077 *
## Ownership_typeCompany - Private 0.055712 .
## Ownership_typeCompany - Public 0.150134
## Ownership_typeGovernment 0.260186
## Ownership_typeHospital 0.006448 **
## Ownership_typeNonprofit Organization 0.004130 **
## Ownership_typeOther Organization 0.064475 .
## Ownership_typeSchool / School District 0.156787
## Ownership_typeSubsidiary or Business Segment 0.089724 .
## SectorAccounting & Legal 0.663240
## SectorAerospace & Defense 0.604919
## SectorAgriculture & Forestry 0.737403
## SectorArts, Entertainment & Recreation 0.029120 *
## SectorBiotech & Pharmaceuticals 0.644946
## SectorBusiness Services 0.337574
## SectorConstruction, Repair & Maintenance 0.639863
## SectorConsumer Services 0.457474
## SectorEducation 0.100210
## SectorFinance 0.346271
## SectorGovernment 0.305383
## SectorHealth Care 0.859154
## SectorInformation Technology 0.596942
## SectorInsurance 0.572006
## SectorManufacturing 0.215219
## SectorMedia 0.452703
## SectorMining & Metals 0.197067
## SectorNon-Profit 0.502964
## SectorOil, Gas, Energy & Utilities 0.125281
## SectorReal Estate 0.275927
## SectorRetail 0.744023
## SectorTelecommunications 0.171929
## SectorTransportation & Logistics 0.368648
## SectorTravel & Tourism 0.295482
## Revenue$1 to $5 million (USD) 0.062136 .
## Revenue$10 to $25 million (USD) 0.038083 *
## Revenue$10+ billion (USD) 0.060039 .
## Revenue$100 to $500 million (USD) 0.818238
## Revenue$2 to $5 billion (USD) 0.227958
## Revenue$25 to $50 million (USD) 0.382532
## Revenue$5 to $10 billion (USD) 0.491531
## Revenue$5 to $10 million (USD) 0.015292 *
## Revenue$50 to $100 million (USD) 0.687082
## Revenue$500 million to $1 billion (USD) 0.388074
## RevenueLess than $1 million (USD) 0.701701
## RevenueUnknown / Non-Applicable 0.749135
## Hourly 3.84e-05 ***
## Job.LocationAZ 0.482668
## Job.LocationCA 8.62e-07 ***
## Job.LocationCO 0.001952 **
## Job.LocationCT 0.583926
## Job.LocationDC 0.003413 **
## Job.LocationDE 0.021695 *
## Job.LocationFL 0.259171
## Job.LocationGA 0.715304
## Job.LocationIA 0.204193
## Job.LocationID 0.285649
## Job.LocationIL 0.000575 ***
## Job.LocationIN 0.300442
## Job.LocationKS 0.723352
## Job.LocationKY 9.35e-05 ***
## Job.LocationLA 0.519165
## Job.LocationMA 0.004923 **
## Job.LocationMD 0.000903 ***
## Job.LocationMI 0.001190 **
## Job.LocationMN 0.015681 *
## Job.LocationMO 0.006747 **
## Job.LocationNC 0.010682 *
## Job.LocationNE 0.322412
## Job.LocationNJ 3.39e-05 ***
## Job.LocationNM 0.585805
## Job.LocationNY 0.000786 ***
## Job.LocationOH 0.005502 **
## Job.LocationOR 0.075862 .
## Job.LocationPA 0.019145 *
## Job.LocationRI 0.009975 **
## Job.LocationSC 0.602963
## Job.LocationTN 0.546850
## Job.LocationTX 0.012918 *
## Job.LocationUT 0.002176 **
## Job.LocationVA 0.006331 **
## Job.LocationWA 0.022566 *
## Job.LocationWI 0.130056
## Python 0.001073 **
## aws 0.851773
## excel 0.602373
## sql 0.087497 .
## sas 0.006709 **
## flink 0.891473
## mongo 0.000435 ***
## google_an 0.422799
## Job_titledata analitics 0.652223
## Job_titledata engineer < 2e-16 ***
## Job_titledata modeler 0.030503 *
## Job_titledata scientist < 2e-16 ***
## Job_titleData scientist project manager 0.362858
## Job_titledirector 6.59e-08 ***
## Job_titlemachine learning engineer 5.70e-14 ***
## Job_titleother data related jobs 0.114624
## Job_titleother scientist 1.24e-06 ***
## Experienceno experience required 0.558180
## Experiencesr 0.329664
## Degreena 0.665484
## DegreeP 0.251170
## ML_skills 0.282502
## vizualizationboth 0.015602 *
## vizualizationnone 0.077661 .
## vizualizationtableau 0.210625
## data_frameworkhadoop 0.120558
## data_frameworknone 0.477579
## data_frameworkspark 0.487561
## log_Comp_history 0.810034
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2246 on 622 degrees of freedom
## Multiple R-squared: 0.6943, Adjusted R-squared: 0.6363
## F-statistic: 11.97 on 118 and 622 DF, p-value: < 2.2e-16
confint(logfit)
## 2.5 % 97.5 %
## (Intercept) 3.387333290 4.579627381
## Rating -0.001943635 0.065426517
## Size10000+ -0.224894194 0.084099609
## Size1001 - 5000 -0.135539716 0.124839950
## Size201 - 500 -0.143018889 0.105862283
## Size5001 - 10000 -0.233413320 0.071702168
## Size501 - 1000 -0.192931830 0.057406757
## Size51 - 200 -0.097344296 0.142262308
## Sizeunknown 0.037300111 0.618963372
## Ownership_typeCompany - Private -0.511466521 0.006182750
## Ownership_typeCompany - Public -0.461386596 0.070868774
## Ownership_typeGovernment -0.597744742 0.161835848
## Ownership_typeHospital -0.755872030 -0.123837734
## Ownership_typeNonprofit Organization -0.650701585 -0.122944941
## Ownership_typeOther Organization -0.633525544 0.018530363
## Ownership_typeSchool / School District -0.698638789 0.112831420
## Ownership_typeSubsidiary or Business Segment -0.503719398 0.036339960
## SectorAccounting & Legal -0.687176272 0.437642833
## SectorAerospace & Defense -0.427068574 0.248900279
## SectorAgriculture & Forestry -0.695022443 0.492222771
## SectorArts, Entertainment & Recreation -0.927051366 -0.049837929
## SectorBiotech & Pharmaceuticals -0.385723169 0.239051078
## SectorBusiness Services -0.481325275 0.165311100
## SectorConstruction, Repair & Maintenance -0.539762932 0.331964159
## SectorConsumer Services -0.554070817 0.249749727
## SectorEducation -0.650526955 0.057212868
## SectorFinance -0.481469825 0.169177467
## SectorGovernment -0.538187334 0.168856151
## SectorHealth Care -0.351602099 0.293303010
## SectorInformation Technology -0.400125702 0.230283296
## SectorInsurance -0.403519637 0.223106126
## SectorManufacturing -0.538063564 0.121432320
## SectorMedia -0.521866647 0.233027386
## SectorMining & Metals -0.711665877 0.147017574
## SectorNon-Profit -0.519218902 0.254988514
## SectorOil, Gas, Energy & Utilities -0.765179416 0.093763159
## SectorReal Estate -0.578874155 0.165518854
## SectorRetail -0.389026764 0.278057283
## SectorTelecommunications -0.646243026 0.115650135
## SectorTransportation & Logistics -0.530079530 0.196989548
## SectorTravel & Tourism -0.580991167 0.176898607
## Revenue$1 to $5 million (USD) -0.418642823 0.010388132
## Revenue$10 to $25 million (USD) -0.271639869 -0.007701451
## Revenue$10+ billion (USD) -0.004811132 0.231853428
## Revenue$100 to $500 million (USD) -0.114367565 0.090395114
## Revenue$2 to $5 billion (USD) -0.191209533 0.045649889
## Revenue$25 to $50 million (USD) -0.197138954 0.075720035
## Revenue$5 to $10 billion (USD) -0.203786519 0.098010274
## Revenue$5 to $10 million (USD) 0.043339003 0.406800326
## Revenue$50 to $100 million (USD) -0.095437495 0.144723066
## Revenue$500 million to $1 billion (USD) -0.171515499 0.066729438
## RevenueLess than $1 million (USD) -0.372312225 0.250734566
## RevenueUnknown / Non-Applicable -0.111617247 0.080344605
## Hourly -0.519862271 -0.185716042
## Job.LocationAZ -0.159430900 0.337004187
## Job.LocationCA 0.297466465 0.685954747
## Job.LocationCO 0.145173109 0.642295969
## Job.LocationCT -0.221108865 0.392251664
## Job.LocationDC 0.121051541 0.608468558
## Job.LocationDE 0.056150929 0.709241232
## Job.LocationFL -0.093899860 0.348091719
## Job.LocationGA -0.216948708 0.315976087
## Job.LocationIA -0.102083053 0.476675703
## Job.LocationID -0.172596272 0.584689263
## Job.LocationIL 0.160853588 0.582783662
## Job.LocationIN -0.117820329 0.381133055
## Job.LocationKS -0.312806855 0.450450007
## Job.LocationKY 0.293990638 0.880468773
## Job.LocationLA -0.208789162 0.413021221
## Job.LocationMA 0.088604729 0.494026232
## Job.LocationMD 0.138550190 0.535310404
## Job.LocationMI 0.186265882 0.752238571
## Job.LocationMN 0.088356688 0.844203632
## Job.LocationMO 0.098888895 0.613708888
## Job.LocationNC 0.069402644 0.526072361
## Job.LocationNE -0.157876737 0.479085423
## Job.LocationNJ 0.255529921 0.709138957
## Job.LocationNM -0.420713428 0.237869703
## Job.LocationNY 0.144579393 0.547107772
## Job.LocationOH 0.097560238 0.563701869
## Job.LocationOR -0.029001032 0.584692656
## Job.LocationPA 0.043527700 0.487524619
## Job.LocationRI 0.163295343 1.196305391
## Job.LocationSC -0.359966179 0.619538484
## Job.LocationTN -0.164726664 0.310652568
## Job.LocationTX 0.058126754 0.489338002
## Job.LocationUT 0.166683807 0.754269030
## Job.LocationVA 0.078942924 0.478683251
## Job.LocationWA 0.039560008 0.521176860
## Job.LocationWI -0.058127033 0.451583939
## Python 0.032160029 0.127685093
## aws -0.044454573 0.053808306
## excel -0.029173249 0.050256771
## sql -0.094053606 0.006457681
## sas 0.028449007 0.176191438
## flink -0.155457059 0.178682092
## mongo 0.075987969 0.265672516
## google_an -0.246676336 0.103604642
## Job_titledata analitics -0.152056715 0.242692906
## Job_titledata engineer 0.302389520 0.466214867
## Job_titledata modeler 0.024429655 0.493243255
## Job_titledata scientist 0.347417429 0.491136954
## Job_titleData scientist project manager -0.075209883 0.205265737
## Job_titledirector 0.391992927 0.831323929
## Job_titlemachine learning engineer 0.412573117 0.695457597
## Job_titleother data related jobs -0.034218488 0.315930187
## Job_titleother scientist 0.140757166 0.329209525
## Experienceno experience required -0.367443966 0.198579970
## Experiencesr -0.143526580 0.426898155
## Degreena -0.035403405 0.055405854
## DegreeP -0.028680267 0.109498616
## ML_skills -0.012319693 0.042161085
## vizualizationboth 0.046659162 0.444200625
## vizualizationnone -0.018902274 0.359020004
## vizualizationtableau -0.070843327 0.320692916
## data_frameworkhadoop -0.022452801 0.193043286
## data_frameworknone -0.044534265 0.095042783
## data_frameworkspark -0.052784407 0.110560963
## log_Comp_history -0.032867637 0.025695951
# QQ plot of the residuals
qqnorm(residuals(logfit))
qqline(residuals(logfit))
# Diagnostic plots
par(mfrow = c(2, 3))
# Residuals vs Fitted Values
plot(logfit, which = 1)
# Scale-Location (also called Spread-Location)
plot(logfit, which = 3)
## Warning: not plotting observations with leverage one:
## 290, 449
# Cook's distance plot
plot(logfit, which = 4)
# Residuals vs Leverage
plot(logfit, which = 5)
## Warning: not plotting observations with leverage one:
## 290, 449
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
# Histogram of residuals
hist(resid(logfit))
Independence: We assume that each job posting on Glassdor is independent. The log transformation does not affect the independence between the data points.
Linearity: Even though the linearity between the salary an the predictors individual not very strong (0.324 as the highest correlation), we assume that there is a linearity between the average salary with all of the predictors remained in the model.
Normality: The distribution of the residuals is less skewed. We assume that the normality is met.
Homoscedasticity: The log transformation also improved the inconstant variance issue. The residuals in the residual vs. fitted value plot appear to form a random cloud around 0.
There are two observations 290 and 449 appeared to have leverage higher than 1. After checking the data set to see if there is any error in the in the data collecting process, we did not see any reasons or explanations why these two observations were outliers. Their average salary in close to the mean salary, their predictor values were also not in the extreme sides. We also fitted the models without the outliers and see no much of a difference in our p-values or the coefficients. Therefore, we decided to keep these outliers and will interpret the model with the logged salary and all of the predictors in the current data set.
The reference for the interpretation of the coefficients: a full-time junior analyst job from a college/ university organization in Alabama. We don’t know the revenue, the sector but we know the company size is about 1-50 employees. The requirement for this job is master degree and have to know both bi and tableau.
There is no enough significant evidence to suggest that the company rating or the company age is associated with the change in the median salary, keeping the other predictors constant (p-value = 0.064 and 0.81 respectively).
However, there are overwhelming evidence to suggest that some locations are associated with the change in the median salary. For example, for same job as our reference holding all of the other variables but if the job is in California, the median salary would have a 63% increase (p-value. <0.0001). A 95% confidence interval for this increase is from 35%-99% in the median salary.
Similarly, there is evidence that knowing Python is an advantage. For the sam job with everything else held constant as the reference, additional requirement of Python is associated with 8% increase in the median salary (p-value = 0.001). A 95% confidence interval of this increase is from 3% to 14%.
Also, as we all wanted to know if the data scientist is a good investment. There is strong evidence to suggest that the data scientist job title is associated with the higher pay than an analyst job even if other variables are unchanged from the reference (p-value <0.0001). It is estimated that the median salary of data scientist job title is 52% higher than median salary of a analyst job, holding the other predictors constant. A 95% confidence interval for this increase is from 41% to 63%.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
vif_values <- vif(logfit)
# Display the VIF values
print(vif_values)
## GVIF Df GVIF^(1/(2*Df))
## Rating 2.773333e+00 1 1.665333
## Size 7.176802e+02 7 1.599540
## Ownership_type 9.851103e+02 8 1.538483
## Sector 3.040314e+06 24 1.364774
## Revenue 3.490705e+03 12 1.404822
## Hourly 3.331298e+00 1 1.825184
## Job.Location 2.277312e+05 36 1.186885
## Python 2.164492e+00 1 1.471221
## aws 1.664754e+00 1 1.290253
## excel 1.498627e+00 1 1.224184
## sql 2.402871e+00 1 1.550120
## sas 1.686041e+00 1 1.298476
## flink 1.415083e+00 1 1.189573
## mongo 1.624981e+00 1 1.274747
## google_an 2.165237e+00 1 1.471474
## Job_title 9.586276e+01 9 1.288521
## Experience 1.715823e+00 2 1.144506
## Degree 3.224348e+00 2 1.340017
## ML_skills 1.690143e+00 1 1.300055
## vizualization 4.322933e+00 3 1.276330
## data_framework 4.582293e+00 3 1.288785
## log_Comp_history 3.165049e+00 1 1.779058
library(leaps)
set.seed(123)
reg.fwd=regsubsets(log(AvgSalary)~.,data=data,method="forward",nvmax=40)
summary(reg.fwd)$adjr2
## [1] 0.1692560 0.2756362 0.3344676 0.3843202 0.4270107 0.4610551 0.4845899
## [8] 0.5157771 0.5329167 0.5433114 0.5531788 0.5630019 0.5714122 0.5785861
## [15] 0.5849125 0.5899307 0.5944495 0.5988805 0.6019873 0.6053165 0.6081596
## [22] 0.6106036 0.6126111 0.6148856 0.6166975 0.6184557 0.6200230 0.6214625
## [29] 0.6228650 0.6242618 0.6257571 0.6279423 0.6297105 0.6311011 0.6323866
## [36] 0.6332987 0.6343446 0.6353733 0.6361483 0.6369464
summary(reg.fwd)$rss
## [1] 85.18064 74.17240 68.05592 62.87268 58.43366 54.88703 52.41869 49.17967
## [9] 47.37410 46.25645 45.19501 44.14079 43.23182 42.44971 41.75485 41.19316
## [17] 40.68295 40.18280 39.81635 39.42854 39.09015 38.79231 38.53857 38.25886
## [25] 38.02567 37.79831 37.59032 37.39539 37.20452 37.01459 36.81536 36.54878
## [33] 36.32369 36.13610 35.95917 35.81908 35.66618 35.51524 35.38928 35.26128
summary(reg.fwd)$bic
## [1] -125.1923 -221.1254 -278.2896 -330.3822 -378.0300 -417.8197 -445.3079
## [8] -485.9631 -507.0719 -518.1552 -528.7488 -539.6302 -548.4406 -555.3608
## [15] -560.9826 -564.4103 -567.0374 -569.5956 -569.7762 -570.4209 -570.1999
## [22] -569.2595 -567.5143 -566.3040 -564.2262 -562.0620 -559.5427 -556.7872
## [29] -553.9711 -551.1556 -548.5469 -547.3240 -545.2935 -542.5223 -539.5513
## [36] -535.8358 -532.3976 -528.9321 -524.9570 -521.0340
par(mfrow=c(1,3))
bics<-summary(reg.fwd)$bic
bics
## [1] -125.1923 -221.1254 -278.2896 -330.3822 -378.0300 -417.8197 -445.3079
## [8] -485.9631 -507.0719 -518.1552 -528.7488 -539.6302 -548.4406 -555.3608
## [15] -560.9826 -564.4103 -567.0374 -569.5956 -569.7762 -570.4209 -570.1999
## [22] -569.2595 -567.5143 -566.3040 -564.2262 -562.0620 -559.5427 -556.7872
## [29] -553.9711 -551.1556 -548.5469 -547.3240 -545.2935 -542.5223 -539.5513
## [36] -535.8358 -532.3976 -528.9321 -524.9570 -521.0340
plot(1:40,bics,type="l",ylab="BIC",xlab="# of predictors")
index<-which(bics==min(bics))
points(index,bics[index],col="red",pch=10)
adjr2<-summary(reg.fwd)$adjr2
plot(1:40,adjr2,type="l",ylab="Adjusted R-squared",xlab="# of predictors")
index<-which(adjr2==max(adjr2))
points(index,adjr2[index],col="red",pch=10)
rss<-summary(reg.fwd)$rss
plot(1:40,rss,type="l",ylab="train RSS",xlab="# of predictors")
index<-which(rss==min(rss))
points(index,rss[index],col="red",pch=10)
coef(reg.fwd,20)
## (Intercept) Sizeunknown
## 3.98722404 0.38919056
## SectorArts, Entertainment & Recreation SectorBiotech & Pharmaceuticals
## -0.36916639 0.08067264
## SectorInformation Technology Revenue$10+ billion (USD)
## 0.07653221 0.13111168
## Revenue$5 to $10 million (USD) Hourly
## 0.27740529 -0.32720170
## Job.LocationCA Job.LocationIL
## 0.26350615 0.14470722
## Job.LocationMA Job.LocationMD
## 0.10443946 0.11547814
## Job.LocationNJ Job.LocationNY
## 0.25304586 0.09161447
## Python Job_titledata engineer
## 0.07122945 0.38634057
## Job_titledata scientist Job_titledirector
## 0.44838975 0.63346997
## Job_titlemachine learning engineer Job_titleother scientist
## 0.57868273 0.22483849
## Experiencesr
## 0.21739204
Notably, the forward feature selection model based on Bic has fewer predictors than the one based on the adjusted r squared or rss. Among the 20 predictors, some of them are statistical significant in the the model 2, including but not limited to Python, Job.LocationCA, and Job_titledatascientist. However, there are also predictors appeared to be statistically insignificant in our previous model but now enter the feature selection model, such as Experiencesr or SectorBiotech & Pharmaceuticals. We found this is very interesting as we know that in reality, people from senior level are paid higher than the junior level. This is evidence that we can not always trust the results right away but instead using our domain knowledge to find the appropriate predictors for our model.
We then move forward with the new subset of predictors which our group believe are practically or statistically significant to the salary. In. model 3, we will implement the glmnet technique and CV to collect the RMSE for the model comparison in objective 2.
#glmnet - Regression Coefficients
library(caret)
set.seed(123)
fitControl<-trainControl(method="repeatedcv",number=10,repeats=10)
glmnet.logfit1<-train(log(AvgSalary)~Job_title + Hourly + Job.Location + Python +
sas + mongo + Experience + Degree,
data=data,
method="glmnet",
trControl=fitControl
)
opt.pen1<-glmnet.logfit1$finalModel$lambdaOpt
coef(glmnet.logfit1$finalModel,opt.pen1)
## 54 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 4.0155680205
## Job_titledata analitics 0.0805688036
## Job_titledata engineer 0.4109063120
## Job_titledata modeler 0.2109039202
## Job_titledata scientist 0.4822331191
## Job_titleData scientist project manager 0.0434483727
## Job_titledirector 0.6421384942
## Job_titlemachine learning engineer 0.5707006498
## Job_titleother data related jobs 0.2376108630
## Job_titleother scientist 0.2589587977
## Hourly -0.3793169705
## Job.LocationAZ -0.1188641425
## Job.LocationCA 0.2353480469
## Job.LocationCO 0.0121288414
## Job.LocationCT .
## Job.LocationDC 0.1566006592
## Job.LocationDE 0.1197544067
## Job.LocationFL -0.1254833908
## Job.LocationGA -0.1337735800
## Job.LocationIA -0.0905371999
## Job.LocationID .
## Job.LocationIL 0.1022425407
## Job.LocationIN -0.0878126905
## Job.LocationKS -0.1650966884
## Job.LocationKY 0.1776508822
## Job.LocationLA -0.1405475825
## Job.LocationMA 0.1449365844
## Job.LocationMD 0.1198878298
## Job.LocationMI 0.1076424850
## Job.LocationMN 0.1522937183
## Job.LocationMO 0.0335504368
## Job.LocationNC 0.0177597799
## Job.LocationNE -0.0931505853
## Job.LocationNJ 0.2133951884
## Job.LocationNM -0.2388818430
## Job.LocationNY 0.0560818741
## Job.LocationOH .
## Job.LocationOR .
## Job.LocationPA 0.0155628635
## Job.LocationRI 0.2908423291
## Job.LocationSC -0.1254685601
## Job.LocationTN -0.1213105284
## Job.LocationTX -0.0001210612
## Job.LocationUT 0.0618885752
## Job.LocationVA -0.0021349404
## Job.LocationWA -0.0694423877
## Job.LocationWI -0.0522838394
## Python 0.0717481240
## sas 0.0149484861
## mongo 0.0762741836
## Experienceno experience required .
## Experiencesr 0.2367047984
## Degreena .
## DegreeP 0.0444660748
glmnet.logfit1
## glmnet
##
## 741 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 666, 667, 667, 667, 666, 667, ...
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.10 0.0003072974 0.2493264 0.5561643 0.1951310
## 0.10 0.0030729741 0.2492737 0.5553024 0.1946357
## 0.10 0.0307297408 0.2545946 0.5396723 0.1974444
## 0.55 0.0003072974 0.2491700 0.5564643 0.1949388
## 0.55 0.0030729741 0.2490330 0.5556131 0.1942840
## 0.55 0.0307297408 0.2692501 0.4981201 0.2096308
## 1.00 0.0003072974 0.2491092 0.5565036 0.1948045
## 1.00 0.0030729741 0.2491404 0.5552192 0.1944404
## 1.00 0.0307297408 0.2816638 0.4657545 0.2217544
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.55 and lambda = 0.003072974.
plot(glmnet.logfit1)
The best RMSE from model 3 is 0.2490 which is about 28% than the actual
salary. Notice that the alpha is 0.55, and the lambda or the penalty
term is 0.003. We will check for mother EDA to consider the appropriate
interation term and ad more complexity to our model.
ggplot(data=data,aes(x=Job.Location,y=log(AvgSalary),colour=Job_title))+geom_boxplot()
It is hard to say whether the job titles have different slope in this boxplot. Some states we barely have any formation, like Kentucky or South Carolina.
ggplot(data=data,aes(x=Job_title,y=log(AvgSalary),colour=Experience))+geom_boxplot()
However, we believe there is some interaction between the job titles and the experience levels. In some job, there is a significant difference in salary between the the jobs required the senior experience level and the jobs do not require a specific experience level, like data scientist and analyst. Whereas, with the jobs that have need some type of data skills, the median pay between senior experience required and no experience required are almost the same.
ggplot(data=data,aes(x=Job_title,y=log(AvgSalary),colour=Degree))+geom_boxplot()
ggplot(data=data,aes(x=Experience,y=log(AvgSalary),colour=Degree))+geom_boxplot()
Similarly, we believe that there may be an interaction between the
experience level and the degree required. For the jobs that need the
senior experience level, there is not much of a difference in the median
salary if the jobs require different level of education. It is not the
case for the jobs that do not require a specific level of seniority, as
the higher degree are evidently associated with the higher pay.
ggplot(data=data,aes(x=Ownership_type,y=log(AvgSalary),colour=Degree))+geom_boxplot()
After checking a few more additional EDA, we want to add the complexity
in our models for the prediction purpose.
Based on the previous EDA and models, we will fit the model 4 with the same subset of predictors of model 3 with the additional interactions between Experience, Degree, and Job titles.
#glmnet - Regression Coefficients
library(caret)
set.seed(123)
fitControl<-trainControl(method="repeatedcv",number=10,repeats=10)
glmnet.logfit2<-train(log(AvgSalary)~ Hourly + Job.Location + Python + Sector + Revenue + sas + mongo + Job_title * Experience *Degree,
data=data,
method="glmnet",
trControl=fitControl
)
opt.pen2<-glmnet.logfit2$finalModel$lambdaOpt
coef(glmnet.logfit2$finalModel,opt.pen2)
## 166 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 4.1159900843
## Hourly -0.2652120322
## Job.LocationAZ -0.1365085214
## Job.LocationCA 0.1999122557
## Job.LocationCO 0.0098332731
## Job.LocationCT -0.0949371974
## Job.LocationDC 0.0894485958
## Job.LocationDE 0.1001202861
## Job.LocationFL -0.1431111175
## Job.LocationGA -0.1086986009
## Job.LocationIA -0.0752283689
## Job.LocationID -0.0421720950
## Job.LocationIL 0.0651836034
## Job.LocationIN -0.0886974046
## Job.LocationKS -0.0759943491
## Job.LocationKY 0.2114146064
## Job.LocationLA -0.1730990154
## Job.LocationMA 0.0510232405
## Job.LocationMD 0.0668571219
## Job.LocationMI 0.1040172089
## Job.LocationMN 0.1498702836
## Job.LocationMO .
## Job.LocationNC .
## Job.LocationNE -0.1372046261
## Job.LocationNJ 0.1729710066
## Job.LocationNM -0.2469927314
## Job.LocationNY 0.0445769409
## Job.LocationOH .
## Job.LocationOR .
## Job.LocationPA -0.0020659984
## Job.LocationRI 0.1509434295
## Job.LocationSC .
## Job.LocationTN -0.1586873775
## Job.LocationTX .
## Job.LocationUT 0.0568406203
## Job.LocationVA .
## Job.LocationWA .
## Job.LocationWI -0.0094422985
## Python 0.0615145510
## SectorAccounting & Legal .
## SectorAerospace & Defense .
## SectorAgriculture & Forestry .
## SectorArts, Entertainment & Recreation -0.4604294668
## SectorBiotech & Pharmaceuticals 0.0474216772
## SectorBusiness Services -0.0207772932
## SectorConstruction, Repair & Maintenance .
## SectorConsumer Services -0.0916690034
## SectorEducation -0.0690613632
## SectorFinance -0.0373573528
## SectorGovernment -0.1847200562
## SectorHealth Care -0.0217456362
## SectorInformation Technology 0.0424336322
## SectorInsurance .
## SectorManufacturing -0.0421184071
## SectorMedia 0.0664082740
## SectorMining & Metals -0.0108201730
## SectorNon-Profit -0.0983351799
## SectorOil, Gas, Energy & Utilities -0.1330184960
## SectorReal Estate .
## SectorRetail 0.0274670413
## SectorTelecommunications -0.0308000020
## SectorTransportation & Logistics -0.0354344441
## SectorTravel & Tourism -0.0979951340
## Revenue$1 to $5 million (USD) -0.0397659223
## Revenue$10 to $25 million (USD) -0.0521365560
## Revenue$10+ billion (USD) 0.1165721993
## Revenue$100 to $500 million (USD) .
## Revenue$2 to $5 billion (USD) -0.0209567634
## Revenue$25 to $50 million (USD) -0.0376156170
## Revenue$5 to $10 billion (USD) -0.0123117401
## Revenue$5 to $10 million (USD) 0.2073031253
## Revenue$50 to $100 million (USD) 0.0328939716
## Revenue$500 million to $1 billion (USD) -0.0373297995
## RevenueLess than $1 million (USD) .
## RevenueUnknown / Non-Applicable 0.0429089675
## sas 0.0676556281
## mongo 0.0816971279
## Job_titledata analitics .
## Job_titledata engineer 0.3314302082
## Job_titledata modeler .
## Job_titledata scientist 0.3608938170
## Job_titleData scientist project manager 0.0054351676
## Job_titledirector 0.3672082414
## Job_titlemachine learning engineer 0.5468054888
## Job_titleother data related jobs .
## Job_titleother scientist 0.1759767539
## Experienceno experience required .
## Experiencesr 0.1439919059
## Degreena .
## DegreeP 0.0042125175
## Job_titledata analitics:Experienceno experience required .
## Job_titledata engineer:Experienceno experience required .
## Job_titledata modeler:Experienceno experience required .
## Job_titledata scientist:Experienceno experience required .
## Job_titleData scientist project manager:Experienceno experience required 0.0001729265
## Job_titledirector:Experienceno experience required 0.0034567215
## Job_titlemachine learning engineer:Experienceno experience required .
## Job_titleother data related jobs:Experienceno experience required 0.0889889957
## Job_titleother scientist:Experienceno experience required .
## Job_titledata analitics:Experiencesr .
## Job_titledata engineer:Experiencesr .
## Job_titledata modeler:Experiencesr .
## Job_titledata scientist:Experiencesr 0.0604368288
## Job_titleData scientist project manager:Experiencesr .
## Job_titledirector:Experiencesr .
## Job_titlemachine learning engineer:Experiencesr .
## Job_titleother data related jobs:Experiencesr -0.0324372378
## Job_titleother scientist:Experiencesr 0.1035644150
## Job_titledata analitics:Degreena .
## Job_titledata engineer:Degreena .
## Job_titledata modeler:Degreena 0.3631055468
## Job_titledata scientist:Degreena 0.0992442766
## Job_titleData scientist project manager:Degreena .
## Job_titledirector:Degreena .
## Job_titlemachine learning engineer:Degreena .
## Job_titleother data related jobs:Degreena .
## Job_titleother scientist:Degreena -0.0453046611
## Job_titledata analitics:DegreeP .
## Job_titledata engineer:DegreeP .
## Job_titledata modeler:DegreeP .
## Job_titledata scientist:DegreeP .
## Job_titleData scientist project manager:DegreeP .
## Job_titledirector:DegreeP 0.2612175942
## Job_titlemachine learning engineer:DegreeP .
## Job_titleother data related jobs:DegreeP .
## Job_titleother scientist:DegreeP .
## Experienceno experience required:Degreena -0.0595491137
## Experiencesr:Degreena .
## Experienceno experience required:DegreeP 0.0435911779
## Experiencesr:DegreeP .
## Job_titledata analitics:Experienceno experience required:Degreena .
## Job_titledata engineer:Experienceno experience required:Degreena -0.0029401950
## Job_titledata modeler:Experienceno experience required:Degreena 0.0012451030
## Job_titledata scientist:Experienceno experience required:Degreena .
## Job_titleData scientist project manager:Experienceno experience required:Degreena .
## Job_titledirector:Experienceno experience required:Degreena .
## Job_titlemachine learning engineer:Experienceno experience required:Degreena .
## Job_titleother data related jobs:Experienceno experience required:Degreena 0.1117304716
## Job_titleother scientist:Experienceno experience required:Degreena .
## Job_titledata analitics:Experiencesr:Degreena .
## Job_titledata engineer:Experiencesr:Degreena 0.1233948832
## Job_titledata modeler:Experiencesr:Degreena .
## Job_titledata scientist:Experiencesr:Degreena .
## Job_titleData scientist project manager:Experiencesr:Degreena .
## Job_titledirector:Experiencesr:Degreena .
## Job_titlemachine learning engineer:Experiencesr:Degreena .
## Job_titleother data related jobs:Experiencesr:Degreena .
## Job_titleother scientist:Experiencesr:Degreena -0.1101920500
## Job_titledata analitics:Experienceno experience required:DegreeP .
## Job_titledata engineer:Experienceno experience required:DegreeP .
## Job_titledata modeler:Experienceno experience required:DegreeP .
## Job_titledata scientist:Experienceno experience required:DegreeP 0.0729664900
## Job_titleData scientist project manager:Experienceno experience required:DegreeP .
## Job_titledirector:Experienceno experience required:DegreeP 0.0004680340
## Job_titlemachine learning engineer:Experienceno experience required:DegreeP .
## Job_titleother data related jobs:Experienceno experience required:DegreeP .
## Job_titleother scientist:Experienceno experience required:DegreeP .
## Job_titledata analitics:Experiencesr:DegreeP .
## Job_titledata engineer:Experiencesr:DegreeP .
## Job_titledata modeler:Experiencesr:DegreeP .
## Job_titledata scientist:Experiencesr:DegreeP -0.0007041066
## Job_titleData scientist project manager:Experiencesr:DegreeP .
## Job_titledirector:Experiencesr:DegreeP .
## Job_titlemachine learning engineer:Experiencesr:DegreeP .
## Job_titleother data related jobs:Experiencesr:DegreeP .
## Job_titleother scientist:Experiencesr:DegreeP 0.0417117719
glmnet.logfit2
## glmnet
##
## 741 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 666, 667, 667, 667, 666, 667, ...
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.10 0.0003072974 0.2452159 0.5780582 0.1879434
## 0.10 0.0030729741 0.2442440 0.5783665 0.1873601
## 0.10 0.0307297408 0.2438513 0.5757182 0.1896072
## 0.55 0.0003072974 0.2443216 0.5799105 0.1873109
## 0.55 0.0030729741 0.2425565 0.5809966 0.1868317
## 0.55 0.0307297408 0.2585489 0.5422047 0.2029182
## 1.00 0.0003072974 0.2442717 0.5795284 0.1871899
## 1.00 0.0030729741 0.2416463 0.5828406 0.1867379
## 1.00 0.0307297408 0.2776270 0.4862482 0.2190794
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.003072974.
plot(glmnet.logfit2)
The best RMSE in the model 4 is 0.2416, which is about +/- 27% from the actual median salary. Notably the alpha is 1, which is more Lasso than gmnet approach. The lamnda or penalty term is the same as the model 3. The RMSE does not improve much in our opinion even with the new interaction terms.
library(rgl)
knitr::knit_hooks$set(webgl=hook_webgl)
plot(glmnet.logfit2$finalModel, xvar = "lambda", label = TRUE)
As we can see in the plot above, there are many coefficients are forced to be 0 when increasing the penalty term (alpha).
library(caret)
set.seed(1234)
fitControl<-trainControl(method="repeatedcv",number=10,repeats=1)
knn.fit1<-train(log(AvgSalary)~Job_title + Hourly + Job.Location + Python + Ownership_type +
sas + mongo + Experience + Degree,
data=data,
method="knn",
trControl=fitControl
)
opt.pen<-knn.fit1$finalModel$lambdaOpt
knn.fit1
## k-Nearest Neighbors
##
## 741 samples
## 9 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 668, 667, 665, 667, 668, 666, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 0.2702521 0.4806371 0.2063051
## 7 0.2720496 0.4748169 0.2100936
## 9 0.2716478 0.4756803 0.2106513
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
We also tried the nonparametric approach to see if they improved the salary prediction. Unfortunately, we are not able to try the tree models as the maximum amount of level in a categorical variables are 32. There are 37 different job locations in our data set. We will then use knn test to build the next two models. The difference between the model 5 and 6 is the addition of the interaction terms between the Job titles, the Experience and the degree.
#Model 6: KNN Fit
library(caret)
set.seed(1234)
fitControl<-trainControl(method="repeatedcv",number=10,repeats=1)
knn.fit2<-train(log(AvgSalary)~ Hourly + Job.Location + Python + Ownership_type +
sas + mongo + Job_title* Experience * Degree,
data=data,
method="knn",
trControl=fitControl
)
opt.pen2<-knn.fit2$finalModel$lambdaOpt
knn.fit2
## k-Nearest Neighbors
##
## 741 samples
## 9 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 668, 667, 665, 667, 668, 666, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 0.2679873 0.4943566 0.2074422
## 7 0.2708543 0.4812245 0.2116457
## 9 0.2701284 0.4836510 0.2117632
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
Both model 5 and 6 gave similar results. Even though the addition of the interaction term brought down the RMSE, the difference is not significant( RMSE = 0.270 and 0.267 for model 5 nd 6 respectively). The model 4 with the glmnet, interaction terms still has the lowest RMSE of all.
#Initialize an randomly sampled list for prediction on AvgSalary, use each model to predict salary values
# Set seed
set.seed(123)
# Number of samples to generate
num_samples <- 1 #We can adjust as needed
# Initialize an empty data frame for dummy data with the same columns as the original data
prediction_data <- data.frame(matrix(NA, nrow = num_samples, ncol = ncol(data)))
colnames(prediction_data) <- colnames(data)
# Loop through each column to generate random values (pulled from data)
for (col_name in names(data)) {
if (col_name != "AvgSalary") {
prediction_data[[col_name]] <- sample(data[[col_name]], num_samples, replace = TRUE)
} else if (col_name == "AvgSalary") {
# Set AvgSalary column to NULL (empty) for prediction
prediction_data[[col_name]] <- rep(NA, num_samples)
}
}
head(prediction_data)
## Rating Size Ownership_type Sector
## 1 4.3 201 - 500 Company - Public Information Technology
## Revenue Hourly AvgSalary Job.Location Comp_history
## 1 $500 million to $1 billion (USD) 0 NA TX 25
## Python aws excel sql sas flink mongo google_an Job_title Experience
## 1 0 1 1 1 0 0 0 0 data scientist sr
## Degree ML_skills vizualization data_framework log_Comp_history
## 1 P 0 none none 2.302585
#Predict using GLMnet Model 3
prediction_data$Model3 <- predict(glmnet.logfit1, newdata = prediction_data)
prediction_data$Model3 <- exp(prediction_data$Model3)
#Predict using GLMnet Model 4
prediction_data$Model4 <- predict(glmnet.logfit2, newdata = prediction_data)
prediction_data$Model4 <- exp(prediction_data$Model4)
# Predict AvgSalary on the prediction_data using the knn.fit model
prediction_data$Model5 <- predict(knn.fit1, newdata = prediction_data)
prediction_data$Model5<- exp(prediction_data$Model5)
# Predict AvgSalary on the prediction_data using the knn.fit model
prediction_data$Model6 <- predict(knn.fit2, newdata = prediction_data)
prediction_data$Model6<- exp(prediction_data$Model6)
cat("Model 3 GLMNET Logged model without interaction,predicted an average salary of:", head(prediction_data$Model3), "\n")
## Model 3 GLMNET Logged model without interaction,predicted an average salary of: 118.9676
cat("Model 4 GLMNET Logged model with interaction, predicted an average salary of:",
head(prediction_data$Model4), "\n")
## Model 4 GLMNET Logged model with interaction, predicted an average salary of: 108.845
cat("Model 5 KNN model predicted an average salary of:", head(prediction_data$Model5), "\n")
## Model 5 KNN model predicted an average salary of: 131.1116
cat("Model 6 KNN model predicted an average salary of:", head(prediction_data$Model6), "\n")
## Model 6 KNN model predicted an average salary of: 130.0268
We also created an example with random values of the prediction. Notably, the model 4 with the best RMSE gave us the lowest salary.
For the objective 1, there are a some predictors are both practically and statistically significant, such as job location, job titles and hourly. There are some predictors we thought that may matter to the average salary like the company size and the advance ML_skills, however, appeared to be not statistically significant.
After a few attempt with multiple technique, including adding interaction terms and nonparametric model, we still did not have a prediction as good as we hope for. Out of 4 different model with the increased complexity, the glmnet model with interaction terms has the lowest RMSE.
Given that the data set including the job postings scraped from Glassdoor, there is not much information how they were scraped or whether the data were collected randomly. There are some association between the predictors and the change in the median salary, including but not limited to the job locations, job title, Python requirement, etc. We assume that the results may be extended to the other data job postings on Glassdoor in 2021. However, if we want to extend the result to any data jobs, more job listings must be collected from different job sites, from other states, and so on.
For the future endeavors, we want to dig in more the methodology how the job listings were scraped from Glassdoor. We also want to address the weak correlation between the salary and the predictors. Maybe more the we need to collect better predictions that have a stronger relation with the salary.